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Abstract —We propose a convolutional neural network (CNN) architec¬ 
ture for facial expression recognition. The proposed architecture is inde¬ 
pendent of any hand-crafted feature extraction and performs better than 
the earlier proposed convolutional neural network based approaches. 
We visualize the automatically extracted features which have been 
learned by the network in order to provide a better understanding. The 
standard datasets, i.e. Extended Cohn-Kanade (CKP) and MMI Facial 
Expression Databse are used for the quantitative evaluation. On the 
CKP set the current state of the art approach, using CNNs, achieves 
an accuracy of 99 . 2 %. For the MMI dataset, currently the best accuracy 
for emotion recognition is 93 . 33 %. The proposed architecture achieves 
99.6% for CKP and 98.63% for MMI, therefore performing better than the 
state of the art using CNNs. Automatic facial expression recognition has 
a broad spectrum of applications such as human-computer interaction 
and safety systems. This is due to the fact that non-verbal cues are 
important forms of communication and play a pivotal role in interper¬ 
sonal communication. The performance of the proposed architecture 
endorses the efficacy and reliable usage of the proposed work for real 
world applications. 


1 Introduction 

Humans use different forms of communications such 
as speech, hand gestures and emotions. Being able to 
understand one's emotions and the encoded feelings 
is an important factor for an appropriate and correct 
understanding. 

With the ongoing research in the field of robotics, es¬ 
pecially in the field of humanoid robots, it becomes inter¬ 
esting to integrate these capabilities into machines allow¬ 
ing for a more diverse and natural way of communica¬ 
tion. One example is the Software called EmotiChat [1]. 
This is a chat application with emotion recognition. The 
user is monitored and whenever an emotion is detected 
(smile, etc.), an emoticon is inserted into the chat win¬ 
dow. Besides Human Computer Interaction other fields 
like surveillance or driver safety could also profit from 
it. Being able to detect the mood of the driver could help 
to detect the level of attention, so that automatic systems 
can adapt. 



Fig. 1: Example images from the MMI (top) and CKP 
(bottom). The emotions from left to right are: Anger, 
Sadness, Disgust, Happiness, Fear, Surprise. The emotion 
Contempt of the CKP set is not displayed. 


Many methods rely on extraction of the facial region. 
This can be realized through manual inference [2] or 
an automatic detection approach [1]. Methods often 
involve the Facial Action Coding System (FACS) which 
describes the facial expression using Action Units (AU). 
An Action Unit is a facial action like "raising the Inner 
Brow". Multiple activations of AUs describe the facial 
expression [3]. Being able to correctly detect AUs is a 
helpful step, since it allows making a statement about 
the activation level of the corresponding emotion. 
Handcrafted facial landmarks can be used such as done 
by Kotsia et al. [2]. Detecting such landmarks can be 
hard, as the distance between them differs depending 
on the person [4]. Not only AUs can be used to detect 
emotions, but also texture. When a face shows an 
emotion the structure changes and different filters can 
be applied to detect this [4]. 

The presented approach uses Artificial Neural 
Networks (ANN). ANNs differ, as they are trained 
on the data with less need for manual interference. 
Convolutional Neural Networks are a special kind of 
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ANN and have been shown to work well as feature 
extractor when using images as input [5] and are 
real-time capable. This allows for the usage of the raw 
input images without any pre- or postprocessing. 
GoogleNet [6] is a deep neural network architecture that 
relies on CNNs. It has been introduced during the Image 
Net Large Scale Visual Recognition Challenge(ILSVRC) 
2014. This challenge analyses the quality of different 
image classification approaches submitted by different 
groups. The images are separated into 1000 different 
classes organized by the WordNet hierarchy. In the 
challenge "object detection with additional training 
data" GoogleNet has achieved about 44% precision [7]. 
These results have demonstrated the potential which 
lies in this kind of architecture. Therefore it has been 
used as inspiration for the proposed architecture. 

The proposed network has been evaluated on the 
Extended Cohn-Kanade Dataset (Section 4.2) and on 
the MMI Dataset (Section 4.1). Typical pictures of 
persons showing emotions can be seen in Fig. 1. The 
emotion Contempt of the CKP set is not shown as no 
subject with consent for publication and an annotated 
emotion is part of the dataset. Results of experiments 
on these datasets demonstrate the success of using a 
deep layered neural network structure. With a 10-fold 
cross-validation a recognition accuracy of 99.6% has 
been achieved. 

The paper is arranged as follows: After this intro¬ 
duction, Related Work (Section 2) is presented which 
focuses on Emotion/Expression recognition and the var¬ 
ious approaches scientists have taken. Next is Section 3, 
Background, which focuses on the main components 
of the architecture proposed in this article. Section 4 
contains a summary of the used Datasets. In Section 5 
the architecture is presented. This is followed by the 
experiments and its results (Section 6) . Finally, Section 8 
summarizes the article and concludes the article. 

2 Related Work 

A detailed overview for expression recognition was 
given by Caleanu [8] and Bettadapura [9]. In this Section 
mainly work which similar to the proposed method is 
presented as well as few selected articles which give a 
broad overview over the different methodologies. 

Recently Szegedy et al.[6] have proposed an 
architecture called GoogLeNet. This is a 27 layer 
deep network, mostly composed of CNNs. The network 
is trained using stochastic gradient descent. In the 
ILSVRC 2014 Classification Challenge this network 
achieved a top-5 error rate of 6.67% winning the first 
place. 

Using the the Extended Cohn-Kanade Dataset 
(Section 4.2), Happy and Routray [4] classify between 
six basic emotions. Given an input image, their solution 
localizes the face region. From this region, facial patches 


e.g. the eyes or lips are detected and points of interest 
are marked. From the patches which have the most 
variance between two images, features are extracted. 
The dimensionality of the features is reduced and then 
given to a Support Vector Machine (SVM). To evaluate 
the method, a 10-fold cross-validation is applied. The 
average accuracy is 94.09%. 

Video based emotion recognition has been proposed 
by Byeon and Kwak [10]. They have developed a three 
dimensional CNN which uses groups of 5 consecutive 
frames as input. A database containing 10 persons has 
been used to achieve an accuracy of 95%. 

Song et al. [11] have used a deep convolutional neural 
network for learning facial expressions. The created 
network consists of five layers with a total of 65k 
neurons. Convolutional, pooling, local filter layers 
and one fully connected layer are used to achieve an 
accuracy of 99.2% on the CKP set. To avoid overfitting 
the dropout method was used. 

Luecy et al. [12] have created the Extended Cohn- 
Kanade dataset. This dataset contains emotion 
annotations as well as Action Unit annotations. In 
regards to classification, they also have evaluated the 
datasets using Active Appearance Models (AAMs) in 
combination with SVMs. To find the position and track 
the face over different images, they have employed 
A AM which generates a Mesh out of the face. From this 
mesh they have extracted two feature vectors. First, the 
normalized vertices with respect to rotation, translation, 
and scale. Second a gray-scale image from the mesh 
data, and the input images has been extracted. They 
have chosen a cross-validation strategy, where one 
subject is left out in the training process, achieving an 
accuracy of over 80%. 

Anderson et al. [1] have developed a face expression 
system, which is capable of recognizing the six basic 
emotions. Their system is built upon three components. 
The first one is a face tracker (derivative of ratio 
template) to detect the location of the face. The second 
component is an optical flow algorithm to track the 
motion within the face. The last component is the 
recognition engine itself. It is based upon Support 
Vector Machines and Multilayer Perceptrons. This 
approach has been implemented in EmotiChat. They 
achieve a recognition accuracy of 81.82%. 

Kotsia and Pitas [2] detect emotions by mapping a 
Candide grid, a face mask with a low number of 
polygons, onto a person's face. The grid is initially 
placed randomly on the image, then it has to be 
manually placed on the persons face. Throughout the 
emotion, the grid is tracked using a KanadeLucasTomasi 
tracker. The geometric displacement information 
provided by the grid is used as feature vector for 
multiclass SVMs. The emotions are anger, disgust, fear, 
happiness, sadness, and surprise. They evaluate the 
model on the Cohn-Kanade dataset and an accuracy of 
99.7% has been achieved. 

Shan et al. [13] have created an emotion recognition 
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system based on Local Binary Patterns (LBP). The 
LBPs are calculated over the facial region. From the 
extracted LBPs a feature vector is derived. The features 
depend on the position and size of the sub-regions 
over witch the LBP is calculated. AdaBoost is used to 
find the sub-regions of the images which contain the 
most discriminative information. Different classification 
algorithms have been evaluated of which an SVM 
with Boosted-LBP features performs the best with a 
recognition accuracy of 95.1% on the CKP set. 

In 2013 Zafar et al. [14] proposed an emotion recognition 
system using Robust Normalized Cross Correlation 
(NCC). The used NCC is the "Correlation as a Rescaled 
Variance of the Difference between Standardized 
Scores". Outlier pixels which influence the template 
matching too strong or too weak are excluded and 
not considered. This approach has been evaluated 
on different databases including AR FaceDB (85% 
Recognition Accuracy) and the Extended Cohn Kanade 
Database (100% Recognition Accuracy). 


3 Convolutional Neural Networks 

Convolutional Layer: Convolutional Layers perform 
a convolution over the input. Let fk be the filter with a 
kernel size n x m applied to the input x. n x m is the 
number of input connections each CNN neuron has. The 
resulting output of the layer calculates as follows: 

cixu,v) = y] y fk{i,j)xu-i,v-j (1) 

i=— ^ j=— ^ 

To calculate a more rich and diverse representation of 
the input, multiple filters fk with /c G N can be applied 
on the input. The filters fk are realized by sharing 
weights of neighboring neurons. This has the positive 
effect that lesser weights have to be trained in contrast to 
standard Multilayer Perceptrons, since multiple weights 
are bound together. 

Max Pooling: Max Pooling reduces the input by 
applying the maximum function over the input Xi. Let 
m be the size of the filter, then the output calculates as 
follows: 

TTl TD 

M{xi) = max{x,+,,,+, | |A:| < -, |/| < - A:, / G N} (2) 

This layer features translational invariance with re¬ 
spect to the filter size. 

Rectified Linear Unit: A Rectified Linear Unit (ReLU) 
is a cell of a neural network which uses the following 
activation function to calculate its output given x: 

R{x) = max{0,x) (3) 


Using these cells is more efficient than sigmoid and 
still forwards more information compared to binary 
units. When initializing the weights uniformly, half of 
the weights are negative. This helps creating a sparse 
feature representation. Another positive aspect is the 
relatively cheap computation. No exponential function 
has to be calculated. This function also prevents the 
vanishing gradient error, since the gradients are linear 
functions or zero but in no case non-linear functions [15]. 

Fully Connected Layer: The fully connected layer also 
known as Multilayer Perceptron connects all neurons of 
the prior layer to every neuron of its own layer. Let the 
input be x with size k and I be the number of neurons in 
the fully connected layer. This results in a Matrix Wixk- 

F{x) = (j{W * x) (4) 

cr is the so called activation function. In our network 
CF is the identity function. 

Output Layer: The output layer is a one hot vector 
representing the class of the given input image. It there¬ 
fore has the dimensionality of the number of classes. The 
resulting class for the output vector x is: 

C{x) = {i I ^ i : Xj < Xi} (5) 

Softmax Layer: The error is propagated back over 
a Softmax layer. Let N be the dimension of the input 
vector, then Softmax calculates a mapping such that: 

S{x):R^ ^ [0,1]^ 

For each component 1 < j < A^, the output is 
calculated as follows: 

= ( 6 ) 

4 Datasets 

4.1 MMI Dataset 

The MMI dataset has been introduced by Pantic et 
al. [16] contains over 2900 videos and images of 75 per¬ 
sons. The annotations contain action units and emotions. 
The database contains a web-interface with an integrated 
search to scan the database. The videos/images are 
colored. The people are of mixed age, different gender 
and have different ethnical background. The emotions 
investigated are the six basic emotions: Anger, Disgust, 
Fear, Happiness, Sadness, Surprise. 

4.2 CKP Dataset 

This dataset has been introduced by Lucey et al. [12]. 
210 persons, aged 18 to 50, have been recorded depicting 
emotions. This dataset presented by contains recordings 
of emotions of 210 persons at the ages of 18 to 50 years. 
Both female and male persons are present and from 
different background. 81% are Euro-Americans and 13% 
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Anger Disgust Fear Happy Sadness Surprise 

Fig. 2: This Figure shows the differences within the MMI 
dataset. The six used emotions are listed. 


are Afro-Americans. The images are of size 640x490 px 
as well 640x480 px. They are both grayscale and colored. 
In total this set has 593 emotion-labeled sequences. The 
emotions consist of Anger, Disgust, Fear, Happiness, Sad¬ 
ness, Surprise, and Contempt. 

4.3 Comparison 

In the MMI Dataset (Fig. 2) the emotion Anger is 
displayed in different ways, as can be seen by the 
eyebrows, forehead and mouth. The mouth in the lower 
image is tightly closed while in the upper image the 
mouth is open. For Disgust the differences are also 
visible, as the woman in the upper picture has a much 
stronger reaction. The man depicting Fear has contracted 
eyebrows which slightly cover the eyes. On the other 
hand the eyes of the woman are wide open. As for 
Happy both persons are smiling strongly. In the lower 
image the woman depicting Sadness has a stronger lip 
and chin reaction. The last emotion Surprise also has 
differences like the openness of the mouth. 

Such differences also appear in the CKP set (Fig. 3). 
For Anger the eyebrows and cheeks differ. For Disgust 
larger differences can be seen. In the upper picture not 
only the curvature of the mouth is stronger, but the nose 
is also more involved. While both women displaying 
Fear show the same reaction around the eyes the mouth 
differs. In the lower image the mouth is nearly closed 
while teeth are visible in the upper one. Happiness is 
displayed similar. For the emotion Sadness the curvature 
of the mouth is visible in both images, but it is stronger 
in the upper one. The regions around the eyes differ as 
the eyebrows of the woman are straight. The last emotion 
Surprise has strong similarities like the open mouth an 
wide open eyes. Teeth are only displayed by the woman 
in the upper image. 

Thus for a better evaluation it is helpful to investigate 
multiple datasets. This aims at investigating whether the 
proposed approach works on different ways emotions 
are shown and whether it works on different emotions. 
For example Contempt which is only included in the CKP 
set. 



Anger Disgust Fear Happy Sadness Surprise 

Fig. 3: This Figure shows the differences within the 
Cohn-Kanade Plus (CKP) dataset. The emotion Contempt 
is not shown since there is no annotated image with 
the emotion being depicted, which is allowed to be 
displayed. 



Fig. 4: This is the proposed architecture. The main 
component of this architecture is the FeatEx block. In 
the Convolutional layer S depicts the Stride and P the 
Padding. 
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5 Proposed Architecture 

The proposed deep Convolutional Neural Network ar¬ 
chitecture (depicted in Figure 4) consists of four parts. 
The first part automatically preprocesses the data. This 
begins with Convolution 1, which applies 64 different 
filters. The next layer is Pooling 1, which down-samples 
the images and then they are normalized by LRN 1. The 
next steps are the two FeatEx (Parallel Feature Extraction 
Block) blocks, highlighted in Figure 4. They are the 
core of the proposed architecture and described later in 
this section. The features extracted by theses blocks are 
forwarded to a fully connected layer, which uses them 
to classify the input into the different emotions. 

The described architecture is compact, which makes it 
not only fast to train, but also suitable for real-time 
applications. This is also important as the network was 
built with resource usage in mind. 


TABLE 1: This Table lists the different output sizes 
produced by each layer. 


Layer 

Output Size 

Data 

Convolution 1 
Pooling 1 

LRN 1 

224 X 224 

64 X 112 X 112 

64 X 56 X 56 

64 X 56 X 56 

Convolution 2a 
Convolution 2b 
Pooling 2a 
Convolution 2c 
Concat 2 

Pooling 2b 

96 X 56 X 56 

208 X 56 X 56 

64 X 56 X 56 

64 X 56 X 56 

272 X 56 X 56 

272 X 28 X 28 

Convolution 3a 
Convolution 3b 
Pooling 3a 
Convolution 3c 
Concat 3 

Pooling 3b 

96 X 28 X 28 

208 X 28 X 28 

272 X 28 X 28 

64 X 28 X 28 

272 X 28 X 28 

282 X 14 X 14 

Classifier 

11 X 1 X 1 


FeatEx: The key structure in our architecture is the 
Parallel Feature Extraction Block (EeatEx). It is inspired 
by the success of GoogleNet. The block consists of 
Convolutional, Pooling, and ReLU Layers. The first 
Convolutional layer in EeatEx reduces the dimension 
since it convolves with a filter of size 1 x 1. It is 
enhanced by a ReLU layer, which creates the desired 
sparseness. The output is then convolved with a filter 
of size 3 X 3. In the parallel path a Max Pooling layer 
is used to reduce information before applying a CNN 
of size 1x1. This application of differently sized filters 
reflects the various scales at which faces can appear. 
The paths are concatenated for a more diverse 
representation of the input. Using this block twice 
yields good results. 


Visualization: The different layers of the architecture 
produce feature vectors as can be seen in Pig 5. The 
first part until LRN 1 preprocesses the data and creates 
multiple modified instances of the input. These show 
mostly edges with a low level of abstraction. The first 


EeatEx block creates two parallel paths of features with 
different scales, which are combined in Concat 2. The 
second EeatEx block refines the representation of the 
features. It also decreases the dimensionality. 

This visualization shows that the concatenation of EeatEx 
blocks is a valid approach to create an abstract feature 
representation. The output dimensionality of each layer 
can be seen in Table 1. 



Convolution 2b 


Convolution 2a 


LRN 1 


Pooling! 



Convolution 2c 

Pooling 2a 


Convolution 1 



Data 



Pig. 5: This Pigure shows example visualizations of the 
different layers. The data is taken from the MMI set. 


6 Experiments and Results 

As implementation Caffe [17] was used. This is a deep 
learning framework, maintained by the Berkeley Vision 
and Learning Center (BVLC). 

CKP: The CKP database has been analyzed often 
and many different approaches have been evaluated 
in order to "solve" this set. To determine whether the 
architecture is competitive, it has been evaluated on the 
CKP dataset. Por the experiments all 5870 annotated 
images have been used to do a 10-fold cross-validation. 
The proposed architecture has proven to be very effective 
on this dataset with an average accuracy of 99.6%. In 













































Table 2 different results from state of the art approaches 
are listed as comparison. The 100% accuracy reported by 
Zafar [14] is based on hand picked images. The results 
are not validated using cross-validation. The confusion 
matrix in Fig. 6a depicts the results and shows that some 
emotions are perfectly recognized. 

MMI: The MMI Database contains videos of peo¬ 
ple showing emotions. From each video the 20 frames, 
which represent the content of the video the most, have 
been extracted fully automatically. The first two of these 
frames have been discarded since they provide neutral 
expressions. 

To determine the frames, the difference between 
grayscale consecutive frames was calculated. To com¬ 
pensate noise the images have been smoothed using a 
Gaussian filter before calculation. To find the 20 most 
representative images, changes which occur in a small 
timeframe, should only be represented by a single image. 
This was achieved by iterating over the differences using 
a maximum filter with decreasing filter size until 20 
frames have been found. In total 3740 images have been 
extracted. 

The original images were then used for training and 
testing. A 10-fold cross-validation has been applied. 
The average accuracy is 98.63%. This is better than the 
accuracies achieved by Wang and Yin [19] (Table 3). To 
our knowledge they have been the only ones to evaluate 
the MMI database on Emotions instead of Action Units. 
The results of the proposed approach are depicted in the 
Confusion Matrix in Fig. 6b. In the figure it is shown that 
the accuracy for Fear is the lowest with 93.75% while 
Happiness is almost perfectly recognized with 98.21%. 
Fear and Surprise are the emotions confused the most. 

7 Discussion 

The accuracy on the CKP set shows that the chosen 
approach is robust, misclassification usually occurs on 
pictures which are the first few instances of an emotion 
sequence. Often a neutral facial expression is depicted 
in those frames. Thus those misclassifications are not 
necessarily an error in the approach, but in the data 
selection. Other than that no major problem could be 

TABLE 2: The CKP database has been very well an¬ 
alyzed and the best possible recognition accuracy has 
been achieved by Aliya Zafar. It is noteworthy that the 
samples he used for training are not randomly selected 
and no cross-validation has been applied. Evaluating this 
database provides information whether the proposed 
approach can compete with those results. 


Author 

Method 

Accuracy 

Aliya Zafar [14] 

NCC 

100% 

Happy et al. [4] 

Facial Patches + SVM 

94.09% 

Lucey et al. [12] 

AAM + SVM 

> 80% 

Song et al. [18] 

ANN (CNN) 

99.2% 

DeXpression(Proposed) 


99.6% 


TABLE 3: This Table summarizes the current state of 
the art in emotion recognition on the MMI database 
(Section 4.1). 


Author 

Method 

Accuracy 

Wang and Yin [19] 

LDA 

93.33% 

Wang and Yin [19] 

QDC 

92.78% 

Wang and Yin [19] 

NBC 

85.56% 

DeXpression (Proposed) 


98.63% 


detected. The emotion Surprise is often confused with 
Disgust with a rate of 0.045% which is the highest. Of 
those images, where an emotion is present, only few 
are wrongly classified. 

As there is no consent for the misclassified images, 
they cannot be depicted here. However some unique 
names are provided. 

Image S119_001_00000010 is classified as Fear while 
the annotated emotion corresponds to Surprise. The 
image depicts a person with a wide open mouth and 
open eyes. Pictures representing Surprise are often very 
similar, since the persons also have wide open mouths 
and eyes. In image S032_004_00000014 the targeted 
label Fear is confused with Anger. While the mouth 
region in pictures with Anger differ, the eye regions are 
alike, since in both situations the eyes and eyebrows are 
contracted. 

Similar effects are experienced when dealing with the 
MMI Dataset. Since the first two frames are discarded 
most pictures with neutral positions are excluded. In 
few images a neutral position can still be found which 
gives rise to errors. For the same reason as the CKP set 
images will not be displayed. Due to the approach to 
extract images of the videos, a unique identifier for the 
misclassified image cannot be provided. 

The top confusions are observed for Fear and Surprise 
with a rate of 0.0159% where Fear is wrongly 
misclassified as Surprise. Session 1937 shows a woman 
displaying Fear but it is classified as Surprise. Both 
share common features like similar eye and mouth 
movement. In both emotions, participants move the 
head slightly backwards. This can be identified by 
wrinkled skin. The second most confusion rate. Surprise 
being mistaken as Sadness, is mostly based on neutral 
position images. Although the first two images are 
not used, some selected frames still do not contain an 
emotion. In Session 1985 Surprise is being mistaken as 
Sadness. The image depicts a man with his mouth being 
slightly curved, making him look sad. 

DeXpression extracts features and uses them to clas¬ 
sify images, but in very few cases the emotions are 
confused. This happens, as discussed, usually in pictures 
depicting no emotion. DeXpression performs very well 
on both tested sets, if an emotion is present. 
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Recognition Accuracy: 0.996 

(a) The confusion matrix of the averaged 10-fold cross- 
validation on the CKP Dataset. The lowest accuracy is 
achieved by the emotion Surprise with 98.79% while Con¬ 
tempt/Sadness are both recognized with 100%. 

8 Conclusion and Future Work 

In this article DeXpression is presented which works 
fully automatically. It is a neural network which has 
little computational effort compared to current state of 
the art CNN architectures. In order to create it the 
new composed structure FeatEx has been introduced. It 
consists of several Convolutional layers of different sizes, 
as well as Max Pooling and ReLU layers. FeatEx creates 
a rich feature representation of the input. 

The results of the 10-foId cross-validation yield, in av¬ 
erage, a recognition accuracy of 99.6% on the CKP 
dataset and 98.36% on the MMI dataset. This shows that 
the proposed architecture is capable of competing with 
current state of the art approaches in the field of emotion 
recognition. 

In Section 7 the analysis has shown, that DeXpression 
works without major mistakes. Most misclassifications 
have occurred during the first few images of an emotion 
sequence. Often in these images emotions are not yet 
displayed. 

Future Work: An application built on DeXpression 
which is used in a real environment could benefit from 
distinguishing between more emotions such as Nervous¬ 
ness and Panic. Such a scenario could be large events 
where an early detection of Panic could help to pre¬ 
vent mass panics. Other approaches to enhance emotion 
recognition could be to allow for composed emotions. 
For example frustration can be accompanied by anger, 
therefore not only showing one emotion, but also the 
reason. Thus complex emotions could be more valuable 
than basic ones. Besides distinguishing between different 
emotions, also the strength of an emotion could be con¬ 
sidered. Being able to distinguish between different lev¬ 
els could improve applications, like evaluating reactions 



Recognition Accuracy: 0.9836 

(b) The confusion matrix of the averaged 10-fold cross- 
validation on the MMI Dataset. The lowest accuracy is 
achieved by the emotion Fear with 93.75%. Happiness is 
recognized with 98.21%. 


to new products. In this example it could predict the 
amount of orders that will be made, therefore enabling 
producing the right amount of products. 
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